Import all relevant libraries

CONTEXT: A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can analyse all relevant customer data and develop focused customer retention programs.

  1. DATA DESCRIPTION: Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about:
  1. PROJECT OBJECTIVE: Build a model that will help to identify the potential customers who have a higher probability to churn. This help the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategising customer retention.

1. Import and warehouse data:

Let's see how each dataset looks

After looking at the data it seems we already have merged data set with us

There is no repeated column present in the dataset and each record is a present in the merged dataset

Let's explain each datasets now :

Inference

Let's confirm if we have a merged dataset with us

All the values are of customers are present in the merged data set. Hence we shall continue our process with the merged dataset only

2. Data cleansing:

It can be observed from the above that the tenure for each of the record for which we donot have the Totalcharges values has null records. Let's confirm it.

3. Data analysis & visualisation:

Checking the distribution of the numerical columns wrt to target column

Male and females tend to spend the same for both of the categories: churning and non-churning

Now let's check for any outliers present and see the impact it might have to the company

The monthly charges for fiber optics is relatively higher indicating the reason of high churn rate for customers
who opted for Fiber Optics internet Service.

Lets create a copy of the existing dataframe and encode all the categorical features present and check the correlation

From all of the above we can come to the conclusion that the below mentioend features can be dropped

4. Data pre-processing:

We saw that the target is imbalanced hence implementing target balance

5. Model training, testing and tuning:

Decision Tree

It show that it is an over fit model where training score much better that the testing. Let's visualize the decision tree

Let's Visualize how the tree looks without any parameter tuning

It can be clearly visualized how overfit the model is.

Lets see after pruning the Dtree

Let's try to find out the best parameter to be implemented for the decision Tree. we shall be using 5 number of cross-validation you have to try for each selected set of hyperparameters

The score of the model increases with the parameter tuning.

Bagging

Applying Bagging on the decision tree helps improve the accuracy

Random Forest

XGBoost

CatBoost

AdaBoost

Gradient Boosting

Pickle Model

The best model turned out to be GradientBoostingClassifier. Hence let's save it for further use.

6. Conclusion and improvisation: